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Abstract 

Large graphs arise in a number of contexts and understanding their structure and 
extracting information from them is an important research area. Early algorithms 
on mining communities have focused on the global structure, and often run in time 
functional to the size of the entire graph. Nowadays, as we often explore networks 
with billions of vertices and find communities of size hundreds, it is crucial to shift our 
attention from macroscopic structure to microscopic structure in large networks. A 
growing body of work has been adopting local expansion methods in order to identify 
the community members from a few exemplary seed members. 

In this paper, we propose a novel approach for finding overlapping communities 
called Lemon (Local Expansion via Minimum One Norm). The algorithm finds the 
community by seeking a sparse vector in the span of the local spectra such that the 
seeds are in its support. We show that Lemon can achieve the highest detection 
accuracy among state-of-the-art proposals. The running time depends on the size 
of the community rather than that of the entire graph. The algorithm is easy to 
implement, and is highly parallelizable. We further provide theoretical analysis on the 
local spectral properties, bounding the measure of tightness of extracted community 
in terms of the eigenvalues of graph Laplacian. 

Moreover, given that networks are not all similar in nature, a comprehensive analy¬ 
sis on how the local expansion approach is suited for uncovering communities in differ¬ 
ent networks is still lacking. We thoroughly evaluate our approach using both synthetic 
and real-world datasets across different domains, and analyze the empirical variations 
when applying our method to inherently different networks in practice. In addition, 
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the heuristics on how the seed set quality and quantity would affect the performance 
are provided. 


1 Introduction 

Analyzing the structure and extracting information from complex networks is an important 
research area. Significant research has been carried out in finding the structure of networks 
and identifying communities [7]. 

In early work, researchers assumed that communities were disjoint and had more in¬ 
ternal connections than external connections. Both assumptions have been discarded since 
it is clear that in most networks a vertex belongs to more than one community. For in¬ 
stance, in social networks, one might belong to a work community, a community of friends, 
and a community of individuals that share the same hobby such as golf; in co-purchased 
networks, one item might belong to multiple categories. Also since we are dealing with 
networks with hundreds of millions of vertices, an individual in a community of size 10(0 
will certainly have more links outside the community than inside. These key insights have 
motivated us to identify communities from a new perspective. 

Considerable researches on detecting communities have focnsed on the global structure. 
And these globally based detection algorithms usually run in time functional to the size of 
the entire graph, a major drawback in computational cost. Nowadays, we explore networks 
with billions of vertices to find communities of size a hundred. Thus, taking the entire 
graph into account might not serve as a practical solution in many situations. It is thus 
crucial to shift our attention from global structure to local structure in large networks, and 
develop new approaches that enable finding communities in time functional to the size of 
the community. 

Quite recently, there has been a growing interest in finding communities by locally ex¬ 
panding an exemplary seed set in the community of interest mmmm- This type of 
algorithm usually starts with a few members that are already known to be in the target 
community, and the goal is to uncover the remaining members in the community as the 
exemplary members. These known members are usually referred to as seeds in the liter¬ 
ature, and the process of growing the seed set gradually into a larger set until the target 
community is revealed is called seed set expansion. The setting of seed set expansion can 
be widely applied to real world applications. For example, in web search, with a few known 
pages that share similar information, we could generate a larger group of web pages that 
contains the relevant contents with respect to a certain search query; in product networks, 
seed set expansion enables the automatical categorizing of products that are discovered to 
be in the same community as the labeled items. 

The random walk technique has been extensively adopted as a subroutine for locally 

statistical study on social networks done by Leskovec et al. m has shown that real-world commu¬ 
nities with high quality are quite small and usually consist of no more than 100 vertices. 
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growing the seed set in the literature m [9] [mill] [23] [25] [27]. The dynamics of random 
walks are effective in finding a local community since they make non-uniform expansion 
decisions based on the structure revealed during the exploration of the neighborhood sur¬ 
rounding the seeds [3]. This implies that random walk based local expansion is able to 
trace the community members in a principled way that best resembles the natural process 
for forming the local community structure. Very recently, Abrahao et al. also experimen¬ 
tally verified that random walk produces communities that are most structurally similar 
to real-world communities amongst various algorithmic communities [1]. 

In this paper we propose a novel approach for finding overlapping communities called 
Lemon (Local Expansion via Minimum One Norm|^ for finding overlapping communities 
in large networks. We systematically demonstrate that Lemon can achieve both high 
efficiency and effectiveness that significantly stands out amongst state-of-the-art proposals. 
Specifically, we consider the span of a few dimensions of vectors after the short random 
walk and use it as the approximate invariant subspace, which we refer as local spectra. 

In contrast to the traditional spectral clustering methods, our local spectral method 
does not require the burdensome computation of a large number of singular vectors. In 
addition, as traditional spectral methods usually partition the vertices into disjoint com¬ 
munities, we make another fundamental change. Concretely, we mine the communities 
from the subspace by seeking a sparse approximate indicator vector in the span of the local 
spectral such that the seeds are in its support. In practice, this can be mathematically 
achieved by solving a ^^-penalized linear programming problem. 

We aim to develop a comprehensive understanding of the local spectral approach for 
identifying a community from a small seed set. Following the central idea of our approach, 
we seek to solve fundamentally important questions such as: what defines “good” commu¬ 


nities and when do they emerge as we expand the seed set (Section 4.5)? How to find a 
small community in time functional to the size of the community rather than that of the 
entire graph (Section [4.4] )? What defines “good” seeds and how many seeds could uniquely 
define a community (Section]^? 

And given that networks are not all similar in nature, how the local expansion approach 
is suited for uncovering communities in different types of networks (Section |6.3| )? 

We thoroughly evaluate our approach using both synthetic and real-world datasets 
across different domains, and analyze the empirical variations when applying our method 
to inherently different networks in practice. We believe that the insights we gained from 
researching on these problems would provide valuable guidance for future investigation on 
this topic. 


^Our demo code is publicly available at: https://github.com/yixucinli/lemon 
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2 Related Work 


A considerable amount of literature has been published on finding communities in large 
social and information networks. We highlight a few ideas that have recently emerged in 
the literature to clarify how our method differs. 

Globally based community finding algorithms. Various community detection 
algorithms have been developed in the past decade. And most of the algorithms fall into the 
category of global approach. One stream of global algorithms attempt to find communities 
by optimizing an objective function. For example, GCE m identifies maximal cliques as 
seed communities. It expands these cliques by greedily optimizing a local fitness function. 
OSLOM |13] is also based on the optimization of a fitness function, which expresses the 
statistical significance of clusters with respect to random fluctuations (i.e., the random 
graph generated by the configuration model m during community expansion). However, 
the communities identified by mathematical construction may structurally diverge from 
real communities as pointed in [1]. Another main stream of research adopts the label 
propagation approach [22] , which defines rules that simulate the spread of labels of vertices 
in the network. The DEMON algorithm |^, for example, democratically lets each vertex 
vote for the communities it sees surrounding it in its limited view of the global system 
using a label propagation algorithm, and then merges the local communities into a global 
collection. Other approaches such as Link Community (LC) |2| partitions the graph by first 
building a hierarchical link dendrogram according to the link similarity and then cutting 
the dendrogram at some threshold to yield link communities. 

Random walk based detection algorithms. As noted in the preceding section, 
among the divergent approaches, random walks tend to reveal communities that bear the 
closest resemblance to the ground truth communities in nature [I]. In the following, we 
briefly review some methods that have adopted the random walk technique in finding 
communities. Speaking of methods that focus on the global structure, Pons et al. [2T] 
proposed a hierarchical agglomerative algorithm, WalkTrap, that quantified the similarity 
between vertices using random walks and then partitioned the network into non-overlapping 
communities. Meila et al. [TSj presented a clustering approach by viewing the pairwise 
similarities as edge flows in a random walk and studied the eigenvectors and values of the 
resulting transition matrix. A later successful algorithm, Infomap, proposed by by Rosvall 
&: Bergstrom [23| enables uncovering hierarchical structures in networks by compressing 
a description of a random walker as a proxy for real flow on networks. Variants of this 
technique such as biased random walk |28j has also been employed in community finding. 

Local expansion based approaches. To interpret the problem of community de¬ 
tection from a local perspective, our work shares the same spirit as the local expansion 
algorithms in and m- Specifically, Andersen & Lang [^ adapted the theoret¬ 

ical results from |24j to expand a set into a community with locally minimal conductance 
based on lazy random walks. 

However, the lazy random walk endured a much slower mixing speed and it usually took 
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more than 500 hundred steps to converge to a local structure compared with several steps 
of rapid mixing in a regular random walk. Featuring on the seeding strategies, Whang 
et al. [25] established several sophisticated methods for choosing the seed set, and then 
used similar PageRank scheme as that in |3| to expand the seeds until a community with 
optimal conductance is found. Nonetheless, the performance gained by adopting these 
intricate seeding methods was not significantly better than that by using random seeds. 
This implies that a better scheme of expanding the seeds is also needed aside from a good 
seeding strategy. A recent work by Kloumann & Kleinberg m provided a systematic 
understanding of variants of PageRank-based seed set expansion. They showed many 
insightful findings regarding the heuristics on seed set. However, the drawback of lacking 
a proper stop criterion has limited its functionality in practice. Even though a recently 
proposed heat kernel algorithm |10j advances PageRank by introducing a sophisticated 
diffusion method, the detection accuracy achieved by heat kernel approach is still much 


lower than that of Lemon, which we will show in Section 6.1 

Local spectra vs. global spectra. Spectral methods is one of the most widely used 
techniques for exploratory data analysis, with applications ranging from data clustering, 
image segmentation to community detection etc. Spectral clustering makes use of the first 
few singular vectors of the Laplacian matrix associated with a graph, which are inher¬ 
ently global quantities and may not be sensitive to very local information. For example, 
in the case when provided with domain knowledge about a target region in the graph, 
one might be interested in finding clusters only near the specified local region in a semi- 
supervised manner, which might not be otherwise well captured by a method using global 
eigenvectors. Therefore, in the semi-supervised setting, our pioneer work on local spectral 
clustering [TH] have substantial advantage over traditional spectral techniques, with the 
capability of prioritizing and learning more about a local region of the graph surrounding 
the seeds. Although the local spectral proposal in m incorporates the local information as 
an additional constraint based on the global spectral methods, the optimization program 
involves the entire eigenspace, which is less advantageous than using the partial invariant 
subspace constructed by the Krylov subspace in our approach. 


3 preliminaries 

3.1 Problem Statement 

Given a network G = {V,E) and a set of members S in the target community C, where 
\C\ <C |R| and |5| <C |C|, we are interested in discovering the remaining members in C. 
Generally speaking, we focus on answering how to accurately find a small community 
in time functional to the size of the community from a seed set? 

®This manuscript is an extended version of an earlier conference publication m- 
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3.2 Symbols and Definitions 

Table summarizes a list of the different symbols we will use throughout the paper. In 
general, we use italic letters, e.g., n, //, to denote scalars; lower boldface characters, e.g. y, 
to denote vectors; uppercase boldface characters, e.g.. A, to denote matrices; and script 
characters, e.g., C, to denote sets. 


Symbol 

Definiton and description 

5 

Seed set 

C 

Detected community 

c* 

Ground truth community 

Gs 

Subgraph extracted from the neighborhood surrounding the seed set S 

N 

Size of the subgraph Gs 

A5 

Adjacency matrix of subgraph Gs 

A5 

Normalized adjacency matrix of subgraph Gs 

D5 

Diagonal degree matrix of subgraph Gs 

L5 

Laplacian matrix of subgraph Gs 

L5 

Normalized Laplacian matrix of subgraph Gs 

^k,i 

/-dimensional local spectral subspace with k-siep random walks. 

$(V) 

Conductance of the node set V 

,(H, 

The i-th smallest eigenvalues of matrix H 

y 

Probability indicator vector, where larger value indicates a higher 
possibility being in the same community as the seeds 


Table 1: Symbols and Definitons. 


3.3 Datasets 

3.3.1 Synthetic datasets 

The LFR benchmark graphs |12j have been widely adopted for the purpose of evaluating the 
performance of community detection algorithms. LFR datasets are generated with built-in 
community structure that resembles the features found in most real-world networks with 
power-law degree distribution. It provides researchers with rich flexibility to control the 
network topology by tuning different parameters, including the graph size n, the average 
degree k, the maximum degree k^ax, the minimum and maximum community size |C|mm 
and \C\max, the mixing parameter /i, the overlapping membership om and the number of 
vertices with overlapping membership on. Among these parameters, the mixing parameter 
H has the most significant impact on the network topology, which controls the fraction of 
links for each vertex that cross to a community with which the vertex is not associated. 
Usually, larger /r would result in lower detection accuracy. 
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Domain 

Dataset 

Vertices 

Links 

Average 

membership 

Maximum 

membership 

Community 
size mean 

Product 

Amazon 

334,863 

925,872 

0.11 

49 

39 

Collaboration 

DBLP 

317,080 

1,049,866 

0.22 

11 

251 

Social 

YouTube 

1,134,890 

2,987,624 

0.05 

41 

79 

Social 

Orkut 

3,072,441 

117,185,083 

9.56 

504 

83 


Table 2: Statistics for the real networks. 


Xie et al. [26] have performed a thorough performance comparison of different state- 
of-the-art overlapping community detection algorithms on LFR benchmark datasets. To 
make the performance evaluation of our algorithm consistent with that in [26], we adopt 
the same parameters in our paper. In total, we generate two sets of networks with mixing 
parameter /r = 0.1 and jjL = 0.3 respectively. 

We vary the parameter om from 2 to 8 for each ^ and obtain a total of 14 networks. 
Table lists the value of the parameters we have used for generating the LFR datasets. 


Parameter 

Description 

Value 

n 

graph size 

5000 


mixing parameter 

{0.1, 0.3} 

k 

average degree 

10 

h 

^max 

maximum degree 

50 

|^|mm 

minimum community size 

20 

\^\max 

maximum community size 

100 

Tl 

node degree distribution exp. 

2 

T2 

community size distribution exp. 

1 

om 

overlapping membership 

{2, 3, ...,8} 

on 

overlapping node 

2500 


Table 3: Parameters for the LFR datasets. 


3.3.2 Real datasets 

For the purpose of testing on real networks, we include four datasets with ground truth 
community membership from Stanford Network Analysis Projecl]^ These datasets span 
various domains of network applications, including product networks (Amazon), collabo¬ 
ration networks (DBLP), and online social networks (YouTube and Orkut)[^ Each of the 
networks can be viewed as an undirected, connected graph. 

"*http: //snap. stanford.edu 

^For all the four real datasets, we adopt the top 5000 communities that possess the highest quality 
according to m- 
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The statistical information of the datasets is summarized in Table O 


3.4 Evaluation Metric 


For the evaluation metric, we adopt FI score to quantify the similarity between the algo¬ 
rithmic community C and the ground truth community C*. The FI score for each pair of 
(C,C*) is defined by: 


Ti(C,C*) 


2 • Precision{C,C*) ■ RecaU{C,C*) 
Precision{C,C*) + Recall{C,C*) ’ 


( 1 ) 


where the precision and recall are defined as: 


Precision{C, C*) 


|C nc* 


( 2 ) 


\C nC*\ 

Recall{C,n='^^^. (3) 

Throughout the paper, unless otherwise pointed out, the experimental results on syn¬ 
thetic data for each instance are given by the statistical mean and standard deviation based 
on 24 test case^ and the experimental results on real datasets for each instance are based 
on 120 test cases. All the ground truth communities for testing are randomly chosen. The 
randomness of batch tests can guarantee the elimination of statistical bias in our tests. 


4 Local expansion via minimizing one norm 

4.1 Algorithm Overview 

Spectral clustering makes use of a small number of singular vectors proportional to the 
number of communities in the network. If a graph has thousands of small communities, 
it is impractical to calculate a number of singular vectors greater than the number of 
communities. We are experimenting with a fundamentally new technique, which does 
not require the burdensome computation of a large number of singular vectors. Before 
explaining our local spectral approach for finding overlapping communities, it is necessary 
to make clear what we mean by local spectra. 

In traditional spectral clustering methods, one finds the hrst few singular vectors of 
the Laplacian matrij|^ of a graph G with n vertices. Suppose the first d singular vectors 
are obtained, one can form an n x d matrix as a latent space. Then one associates with 
each vertex a point in this latent space whose coordinates are given by the entries of the 

®Each local expansion process from a seed set can be viewed as a test case. 

^In the literature, several different definitions of graph Laplacian exist. Readers can refer to [20] for 
more details, which serves as a good introductory paper on spectral clustering. 






corresponding row in the matrix. Vertices are clustered using some method such as k- 
means clustering algorithm. This method is not likely to work well if the communities are 
small and heavily overlapping with each other. 

We make two fundamental changes to this method. The first modification is to overcome 
the drawback of computing the singular vectors. Intuitively, the vertices around the seed 
members are more likely to be in the target community, thus a random walk serves as a 
natural subroutine to reveal these potential members. 

We start a random walk from several known members in the target community and 
run for a few steps. The number of random walk steps should be long enough to reach out 
to the vertices in the target community, but not long enough to spread out to the entire 
graph. Instead of considering a single probability vector, we consider the span of a few 
dimensions of vectors after the short random walks and use it as the approximate invariant 
subspace {local spectra). The second is to handle the overlapping situation. Instead of 
using A:-means to partition the points in the latent space into disjoint clusters, we look for 
the minimum 0 -norm vector in the span of the invariant subspace obtained above, such 
that the seed members are in its support. We want to find rows in the invariant subspace 
that point in nearly the same direction as seed members. We will use I-norm vector as 
a proxy for the minimum 0-norm vector since finding the 0-norm vector is an NP-hard 
problem. 

In the following, we give a formal description of our local spectral approach Lemon for 
detecting target communities from a small seed set. Given the input of a set of few vertices 
S that are already known to be in the target ground truth community C*, our algorithm 
would output the algorithmic community C such that the FI measure for scoring the sim¬ 
ilarity between C and C* is maximized. Note that the each seed set expansion is operated 
on a small sampled graph Gs = {Vs,£s)j extracted from the neighborhood surrounding 
the seed set S. The details of sampling local graph will be given in Section [4.4[ 

Step 1. Generate the local spectra: 

Consider the subgraph graph Gs extracted from the neighborhood surrounding the 
seed set S. Let As = be the normalized adjacency matrix of the 

graph. We define the normalized adjacency matrix As of the graph Gs as 

A 5 (4) 

where As and D 5 denotes the adjacency matrix and the diagonal degree matrix of G, 
respectively. Consider a random walk starting from exemplary vertices in S. Let po denote 
the initial probability vector where the total probability is evenly distributed among the 
seed members. We describe how to efficiently construct the local spectra by iteratively 
transforming the orthonormal basis starting with a Krylov subspace defined below. 

Definition 1 The order-l + 1 Krylov subspace generated by the matrix A and vector po is 
the linear spanned subspace defined by the probability vectors in I successive random walks 
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( 5 ) 


/Ci+i(A,po) = span (^po, Apo,A'po) 

In Algorithm 1, we briefly summarize the procedure of calculating the local spectral 
subspacefrom a specified seed set S. We start by calculating the initial invariant subspace 
Vo,i, which is the orthonormal basis of ICi+i{As, po). And the local spectral subspace can 
be then obtained by iterating the process specified in Line 4-6 of Algorithm 1. Figure 
shows an example local spectral subspace Vs^a, generated from a synthetic graph with 
Erdos-Renyi G{n,p) model. 

Algorithm 1 LocalSpectral(G'5, 5) 

Input: subgraph Gs, subspace dimension I, and random walk step k 
Output: local spectra \^k,i 
1: Compute normalized adjacency matrix A^ using Q 
2: Initialize po 

3: Vo,i = orth(/Cz+i(A5,po)) 

4: for i = 1,k do 

is obtained by QR factorization so that is 

orthonormal. 

6: end for 

7: Return local spectra 'V k,i 


Step 2. Seek for a sparse vector 

With the local spectra we solve the following linear programming problem, 

min ||y||i 
s.t. y G span(Vfc,i), 
y >0, 
y(5) > 1, 

where the first constraint indicates that y is in the space of V^.;. The element in y 
indicate the likelihood for the corresponding vertex belong to the target community, which 
is non-negative. The third constraint enforces that seeds are in the support of sparse vector 

y- 

After sorting the elements in y in non-ascending order and getting a vector y, the ver¬ 
tices corresponding to the top \C\ elements in y are returned as the detected community 
with respect to the seed set S. 

Step 3. Reseeding 

®In the experiments on real datasets, we fix the walk step k and dimension 1 to be 3 and 3 respec¬ 
tively. For LFR benchmark datasets, we adopt all together 6 combinations for the (step, dimension) tuple: 
(2, 3), (2,4), (2, 5), (3,3), (3,4), (3, 5) and the highest FI score among these combinations will be returned. 
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Figure 1: An example of local spectral subspace The synthetic subgraph Gg is 

generated with Erdos-Renyi G{n,p) model with background noise p = 0.05. The spammer 
group A and B (denoted by blue and pink respectively) are of size 100 with edge probabolity 
p = 0.9, with partial overlapped 20 nodes. The non-spammer group G (denoted by the 
green color) has size 320 with p = 0.2. The subspace is generated by Algorithm 1 starting 
from the seed with index 10 in the spammer group A. 


Augment the initial seed set by adding the vertices corresponding to the top t elements 
of y. Denote the augmented seed set as 5'. Then repeat step 1 and step 2 using the 
augmented seed set S'. The detection accuracy can be improved through iterations via 
increasing t by a constant number s each time. We define s to be the seed expansion 
step, which is used as a tunable parameter for adjusting the convergence rate. Usually, 
the larger expansion step would result in lower performance but a faster running speed 
with less iterations. In the experiments, we fix the seed expansion step to be 6 for both 
synthetic and real datasets. The number of iterations for the seed expansion is determined 
by the stop criteria (Section |4.5|). 


4.2 Parameter Sensitivity 

The random walk step k and subspace dimension I are the key parameters in the local spec¬ 
tral clustering algorithm. We conduct parameter sensitivity study for these two parameter 
on the four real datasets. 
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4.2.1 Subspace dimension 

To study the parameter of subspace dimension we fix the random walk step to be 3, and 
vary the number of dimension I from 1 to 15. Figure (left panel) shows that changing 
the dimension I does not cause significant fluctuation of the performance. On one hand, 
choosing a large dimension I is undesirable because it would increase the computation cost 
in the step of generating local spectra. On the other hand, when dimension degrades to 
I = 1, the standard deviation of FI score becomes significant, making the detection accuracy 
unstable. In this paper, we fix / = 3 because the experiment suggests that setting I = 3 can 
statistically achieve both high and stable performance. Note that such observation holds 
not only for Amazon network, but for the remaining real datasets as well. 

4.2.2 Random ^valk step 

To investigate how the step of random walk affects the algorithm performance, we fix the 
dimension I to be 3, and vary the random walk step k from 1 to 15. Figure (right 
panel) shows that the average FI score plateaus as k increases, and 3-step random walk 
can yield the algorithm’s full potential. The standard deviation, however, significantly 
increases when k exceeds 10. This indicates that longer random walk is undesirable for 
stably uncovering the local community structure. Throughout the paper, we fix the random 
walk step A: = 3 for the real dataset^ 

1.00 


1.00 



dimension random walk step 


Figure 2: The average FI score on Amazon network with varying dimensions I and ran¬ 
dom walk step k, respectively. The plots depict the statistical regression line with a 95% 
confidence interval. 


®For LFR benchmark graphs, we adopt all together 6 combinations for the (step, dimension) tuple: 
(2, 3), (2,4), (2, 5), (3,3), (3,4), (3, 5) and return the highest FI score among using these combinations. 
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1.0 


0.8 

0.6 

I 

I 

0.4 

0.2 

0.0 


LFR (mu=0.1 ,om=3) 


6 8 10 
dimension 


12 14 


1.0 

0.8 
! o.e’ 

I 

I 

I 

■ 0.4 

0.2 

0.0 


LFR {mu=0.1,om=3) 


4 6 8 10 12 14 

random walk step 


Figure 3; The average FI score on LFR benchmark graph (^u = 0.3) with varying dimen¬ 
sions I and random walk step k, respectively. The plots depict the statistical regression 
line with a 95% confidence interval. 


4.3 Local Spectra vs. PageRank 

The local spectra clustering approach and PageRank algorithm both utilize short random 
walks to detect the local community structure. PageRank is solely based on the single 
probability vector, and the latent community members are selected through ranking the 
probability value among vertices. The local spectral clustering advances PageRank-like 
algorithms by forming a subspace based on the short random walk, and seeking for a 
sparse vector such that the seeds are in its support. 



Amazon 

DBLP 

YonTube 

Orkut 

LEMON 

0.953 

0.665 

0.240 

0.202 

PageRank 

0.140 

0.115 

0.136 

0.044 


Table 4: Comparison of the mean FI score with local spectral clustering and PageRank. 
LEMON’ is to use the ground truth size to decide the community size, and LEMON is to 
determine the community size automatically by the first local minimal of the conductance 
value.) 

By comparing the performance of these two approaches on the real datasets, we show 
that seeking for the sparse vector is more effective than directly sorting the probability 
vector alone. Table shows the comparison of average FI score obtained by local spectral 
clustering and PageRank, respectively!^ From the result, we see that the performance gain 
brought by the local spectral method is significant, where it achieves more than 5 times 

^°The statistical results of PageRank algorithm is sourced from m- 
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higher accuracy on Amazon, DBLP and Orkut networks. We also take into account of a 
variety of state-of-the-art community detection algorithms for performance comparison in 
Section IH 



Figure 4: Comparison of the average FI score with local spectral clnstering and PageRank 
algorithms. Height of the bars shows the mean and 95% confidence interval. 


4.4 Complexity Reduction by Sampling Method 

If one wants to uncover a small community within a large network consisting of billions of 
vertices, it would be very costly to take all the vertices into account. We want to discover 
the target community accurately while keeping the number of vertices examined small. 
Sampling method can effectively solve the memory consumption issne when one wants to 
find a local commnnity within a large graph, since the whole graph does not have to be 
stored in memory. 

In practice, the unknown members in the target community are more likely to be around 
the seed members, and are usually a few steps away from the seeds. This observation moti¬ 
vates us to reduce the complexity by taking only a portion of the graph into consideration. 
Ideally, this partial graph should contain as many vertices in the target community as 
possible, and maintains a small size of the same scale as that of the target community. 

To sample the graph, we expand the seed set nsing random walk. After a few steps of the 
random walk, vertices with large probability are more likely to be in the target commnnity 
while vertices with small probability being reached would be treated as redundant ones. If 
the target commnnity exists for the seed set, then according to [Ij, this target commnnity 
wonld serve as a bottleneck for the probability to be spread out. It is worthwhile noting 
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Figure 5: Comparison of the average FI score with ground truth and automatic size deter¬ 
mination. The left corresponds to the LFR datasets when /r = 0.3 and the right corresponds 
to the real datasets. 


that other expansion methods such as breadth-first-search (BFS) would entirely ignore 
the bottleneck dehning the community and rapidly mix with the entire graph before a 
significant fraction of vertices in the community have been reached. The subgraph returned 
by BFS usually contains less vertices in the target community than the subgraph of the 
same size obtained by random walk technique. 

In the experiments on real datasets, we conduct a random walk starting from the seed 


set until the probability has been spread out to a • |C 
and |C|avg is the average community size in the grapi 
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vertices, where a is some constant 
Note that a ■ |C|avg should be large 
enough to be able to cover as many vertices in the ground truth community as possible. 

This newly obtained subgraph will be used for the remaining computation. The com¬ 
plexity of our algorithm now depends on the size of the subgraph after sampling, which is 
0(|C|avg^) for some small constant r. 


Dataset 

Coverage 

ratio 

Sample 

rate 

C avg 

Subgraph 

size 

Amazon 

1.00 

0.0087 

39 

2913 

DBLP 

0.98 

0.0076 

251 

2409 

YouTube 

0.66 

0.0033 

79 

3745 

Orkut 

0.64 

0.0011 

83 

3379 


Table 5: Statistics of the mean values for the sampling method on real datasets. 

Table gives the statistics after applying sampling method to the real networks. 

fast implementation method for updating the probability vector of the random walks is featured in 
detail in |4], Section 4. 
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For example, in DBLP network, setting a to be around 10 would yield a subgraph 
containing on average 98% vertices in the ground truth community. After sampling, we 
only need to deal with a subgraph of size around 2400 instead of 317,080, bringing a 
significant reduction of both temporal and spatial complexity. 

4.5 Round Diffusion Vector via Sweeping Cnt 

If there are ground truth communities available, the above algorithm is guaranteed to stop 
within few iterations since the seed set will no longer augment once its size exceeds that 
of the ground truth community. The algorithm would then return the community found 
with the highest FI score during the iterations as the result. However, in real case, we 
don’t know the exact size of the communities, causing the ambiguity for most locally based 
detection algorithm to decide when is the proper time to terminate expanding such that 
the discovered community is a “good” community. It is thus important to solve the two 
issues: 1) how to automatically determine the size of the community given a seed set 5, 
and 2) when to stop growing the seed set during the reseeding process. 


4.5.1 Determine the size of the community 

It has already been shown that random walks produce communities with conductance 
guarantees and ensure a small boundary defining a natural community in locally based 
detection algorithms [1]. The intuition is that adding irrelevant vertices to the target com¬ 
munity would inevitably cause the conductance to increase, and finding a low-conductance 
community could ensure the closeness between the detected members and the known seed 
set. 

A commonly adopted method of rounding the diffusion values into labels is to perform 
a sweep-cut procedure on the nodes ranked by the diffusion value, with an objective of 
minimizing the graph cut metric such as conductance [31171125]. As we will see, the local 
conductance for a small group of vertices in the graph contains valuable information and 
enables us to design effective stopping criteria for our algorithm. We define conductance 
using the generalized Rayleigh quotient specified below: 


Definition 2 Let x G {0,1}'^ denote the binary indicator vector for the subset V C Vs and 
H G is any symmetric matrix. The Rayleigh quotient with respect to H is expressed 

as the quadratic form of 


Rh(x) 


x^Hx 


( 6 ) 


In particular, conductance of the set V measures the fraction of edges leaving V among all 
the edges incident on V, and can be expressed using a generalized Rayleigh quotient 




x^L^x 

x^D^x 


x^(D 5 - A^jx 
x^D^x ’ 


(7) 
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where L 5 = D 5 — As is the Laplacian matrix of graph Gs- 

Now suppose we have a rough estimation of the lower and upper bound for the size 
of communities in a graph, which we denote by |C|min and |C|max respectively. We could 
modify the original algorithm in the following way. 

At step 2, after obtaining the sorted sparse vector y, we are hoping to truncate the 
sorted vector at some point yg such that all the vertices corresponding to the elements no 
less than yg are included in the algorithmic community. The crux lies in that we do not 
know which is the best position to truncate the vector y. To solve this issue, we denote 
Aj as the set of vertices corresponding to the top i elements in y. We then sweep over the 
sets from to and calculate the corresponding conductance for each of the 

sets. In practice, the value of the conductance with respect to varying size would usually 
change in a non-monotonic pattern that decreases first and then increases later on. We 
then adopt the minimum conductance encountered on this curve as the estimated size of 
the community with respect to the seed set S, which we denote by 

4.5.2 Stop the reseeding process 

As we keep augmenting the seed set through reseeding at step 3, a different seed set 
would result in a different sparse vector y and thus lead to potentially different algorithmic 
communities. Practically, one of these seed sets during the augmenting process would 
achieve the highest FI score. And it remains to address the issue of when to stop growing 
the seed set so that it finds the community that resembles most of the ground truth 
community. This issue can be solved in a similar fashion as that for determining community 
size. Specifically, we keep track of the value of <1>^™ for different seed set during the 
expansion, and stop to grow the seed set when reaches a local minimum and starts 
to increase for the first time. 

4.5.3 Auto detect size vs ground truth size 

To verify our method, we compare the performance after applying the stop criteria with 
that obtained using ground truth communities. Figure 

shows the statistical result of FI score on both synthetic and real datasets. On both 
datasets, the FI score with automatic size determination is only lowered by 10% on average 
compared with the performance with available ground truth. This implies that our method 
is applicable for finding communities that mostly resemble the ground truth communities 
on both synthetic and real datasets in different domains. It also suggests that our method 
can be applied in practice to uncover natural communities in the situation when no ground 
truth is available. 
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4.6 Bounding the Performance 

In the following discussion, we bound the measure of “tightness” of the extracted commu¬ 
nity with respect to the subgraph Gs by relating spectral properties to Rayleigh quotients. 
We start by providing several theorems and lemmas that will be used for deriving the 
bound of conductance. 

Theorem 1 (Cheeger’s Inequality) Let A 2 he the second smallest eigenvalue of the Lapla- 
cian matrix for a graph Gs- Then 4>{Gs) > where (j){Gs) = min^^y^ 

There are many proofs known for this theorem [5], and we henceforth omit the details here. 

Lemma 1 The generalized Rayleigh quotient is equivalent to the form o/Rl^ 

where Ls = l- D 5 - 1 / 2 A 5 D 5 - is the normalized Laplacian matrix of graph Gs- 


Proof: By the definition in Equation we have 




x^D^x 

x^(D5 - A5)x 
x^D^x 
-Rls,Ds (x). 


□ 

Theorem 2 (Courant-Fischer Theorem) Let denote a k dimensional subspace o/M^ 
and X T X^ represents that x T y for all y G X^. For any symmetric matrix H G 
with eigenvalues < A^^^ < ... < A^\ 

A,-^^ = min ( max Rh(x) ) = max ( min Rh(x) ) (8 ) 

* p^N-i-i Vx±A’^-»-i,x^0 / A’* VxUA’Sx^O / 

We will not include the proof of the Courant-Fischer Theorem here. The interested reader 
can find a proof in any major linear algebra textbook. 
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We denote by = z. With the Courant-Fischer Theorem, we can express the 

eigenvalues of the normalized Laplacian matrix in the following: 

f max Rf (z 

* ZN-i-l \z±Z^-i-l,Z7^0 ^ 


= mm 


max 


- A 5 )x 

ZN-i-i V X^D^X 

x'^(D5 - As)x 


= mm 


= mm 


max 

x^O 


x^D^x 

- Xjf 


max 




XN-i-l \x±A’^-»-i,X7^0 Y^xfdi 


< 2 , 


where i ^ j indicates that i is adjacent to j. Similarly. 

j^(Ls) _ / 

A”* V 


mm 
-x±A’\x7^0 


T.{Xi-Xjf 

i^j 


T.xidi 


< 2 . 


(9) 


Corollary 1 Given a graph Gs with N nodes, the largest eigenvector of its normalized 
adjacency matrix is no bigger than 2, i.e., < 2. 

For any symmetric matrix H G with orthonormal eigenvectors qi, q 2 ,qwj 

and corresponding eigenvalues < A^^^ < ••• < A^\ we can always decompose the 
binary indicator vector x into a linear combination of the eigenvectors, i.e., x = ^a^qj. 


This allows us to write 


-Rh(x) = 


x^Hx 




(H). 


E E «*q- 


E«|aW 


E«: 


( 10 ) 


where Wi = a|/||x|p. Hence, the Rayleign quotient can be viewed as a weighted average of 
the eigenvalues. If the indicator vector x forms an acute angle with the invariant subspace 
associated with the extreme eigenvalues, then most of the weight in the average must be 
on eigenvalues close to A)y '. Similarly, (x) can be bounded from below by the smallest 
eigenvalues of H, in which case the indicator vector x can be approximated by a linear 
combination of the eigenvectors associated with the smallest eigenvalues. 

Lemma 2 Let x G {0,1}-^ denote the binary indicator vector for the detected community 
C C Vs corresponding to the seed set S, the conductance of C is bounded by 


A 2/2 < < 1 >(C) < min{l, 2(1 - rci)}. 


( 11 ) 
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where A 2 is the second smallest eigenvalue of Laplacian matrix of Gs, and wi is the weight 
of the smallest eigenvalue of the normalized Laplacian matrix L 5 , as specified in Equation 

m 

Proof: The left side inequality holds due to the fact that ‘l’(C) > 4i{Gs). Following the 
Cheeger’s inequality that c/){Gs) > A 2/2 in Theorem[^ we therefore have A 2/2 < <1’(C). To 
prove the right side, we first express the conductance <h(C) using Rayleigh quotient, 

$(C) = Rl5 ,D5 (x), (12) 

which can be further rewritten as according to Lemma 1. Using similar 

decomposition as that in Equation |lo| we can express as the weighted average 

of the eigenvalues < ... < A^‘®\ i.e., 

i 

< 2{l-wi). 

□ 


5 Seeding 

Since the initial seed set serves as a key component in our algorithm for uncovering the 
target community C, it is thus crucial to consider how the quality of seed set affect the 
performance. In practice, there is not much control over how the seeds are selected. How¬ 
ever, the alternative seeding methods can be strategically applied by domain experts in 
different scenarios based on the availability of candidate seeds. In this section, we will 
focus on addressing two fundamentally important issues regarding the seed set: 1) What 
defines “good” seeds? and 2) How many seeds are needed in order to uniquely define a 
community? 

5.1 Seeding Method 

To give a well-rounded evaluation on this, we encompass in total hve different seeding 
methods here. In this experiment, we adopt |5| = 3 seeds for each of the seeding method 
listed below. 

1. High degree seeding: pick |5| vertices with degree ranked in the top one third 
among the degree of all vertices in C. 
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Seeding method (mu=0.1) 

— High inward-edge ratio 
— High degree 
— Random 
— Low degree 
Triangle 
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Figure 6: The average FI score on LFR datasets (/U = 0.1) with different seeding methods. 

2. Low degree seeding: pick |5| vertices with degree ranked in the bottom one third 
among the degree of all vertices in C. 

3. Triangle seeding: pick |5| vertices in C that form a triangle as the initial seed set. 

4. Random seeding: pick |5| vertices in C randomly. 

5. High inward-edge ratio seeding: the inward-edge ratio for a vertex v is defined 
by the fraction of links connecting to another vertex inside the target community C 
among all the links coming out from v. We pick |5| vertices with inward-edge ratio 
ranked in the top one third among all vertices in C. 

Figure gives the experimental results on LFR benchmark datasets with mixing pa¬ 
rameter /i = 0.1. It is interesting to note that the high-degree seeding method consistently 
achieves the highest FI score in both groups of datasets. When /r = 0.1, triangle seeding 
leads to the worst performance with low FI score and high standard deviation. This implies 
that seeding from a compact core strnctnre is less advantageous than seeding sporadically 
among vertices. The intuitive explanation behind this phenomenon is that it is more dif¬ 
ficult for the probabilities to spread out when the random walk initiates from a cohesive 
structure. 

Another interesting observation is that high inward-edge ratio seeding method can con¬ 
sistently lead to the best performance among different seeding methods on both synthetic 
and real datasets. In m, the authors have the same observation as ours but did not give 
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Figure 7; The average FI score on real datasets with different seeding methods. 


an explicit explanation on this phenomenon. In fact, when a large fraction of the seeds 
links connect to vertices within the same community, random walks starting from these 
seeds would be more likely to transit probabilities into the vertices within the community 
rather than spreading out to vertices outside the community. A higher detection accuracy 
can be thus achieved since the target community contains much of the probability after 
short random walks. 

Moreover, it is also striking to note the difference between the test results on synthetic 
datasets and that on real datasets. Even though the high-degree seeding method can 
always bring higher performance than that of random seeding on synthetic datasets, the 
behavior of these seeding methods on real networks is quite different. In Figure we 
see that low-degree seeds lead to better result than that of high-degree seeds on DBLP 
and YouTube datasets. The degree of seeds does not have a significant impact on the 
performance in Amazon and Orkut networks since the performance of high-degree seeding 
and low-degree seeding almost tie with each other on these datasets. In HU, the authors 
compared the detection accuracy of PageRank based seed set expansion algorithm with 
high-degree seeding and random seeding on real networks, and concluded that random 
seeding method always outperforms high-degree seeding in all domains of real networks. 
However, we remark here that this observation does not apply to our algorithm as we find 
that high-degree seeding works slightly better than random seeding on Orkut and YouTube 
datasets. 
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5.2 Seed Set Size 


It is also interesting to investigate how the size of the seed set affects the performance of 
our algorithm. 

We first experiment on the LFR benchmark datasets with varying seed set size. We 
choose seed set of size proportional to the size of the target community C. Specifically, 
we test with five different seeding ratios r: 2%, 4%, 6%, 8% and 10% respectively, and 
round r ■ \C\ to an integer if it is a fraction. Figure shows the FI scores when /x = 0.1. 
The algorithm’s performance can be improved in general as the seed set size increases. 
In the case when both mixing parameter and overlapping membership are small, e.g., 
H = 0.1, om = 2, increasing the seed set size does not seem to affect the performance 
significantly, and seed set consisting of a small percentage of vertices are sufficient to 
discover the target community with high accuracy. This implies that when the structure 
of a small community is well-defined, our algorithm only needs 2 to 3 seeds to reveal the 
remaining members in a community of size roughly 100. In general, we use an 8% fraction 
of the vertices in the target community for the whole LFR datasets. 


1.0 




Seeding Ratio 

- 2% 

- 4% 

- 6% 

■- 8 % 

- 10% 


Figure 8: The average FI score on LFR benchmark data with different seeding ratio. The 
left figure corre- spends to the datasets with mixing parameter /x = 0.1 and the right one 
corresponds to fi = 0.3.. 

We then carry out the similar experiment on the real datasets. The result on real 
networks is interesting because increasing the seed set size has little affect on the perfor¬ 
mance. Especially, using only 3 seeds can yield almost the same performance as using an 
8% fraction of the vertices in the target community as seeds on real datasets. 

Our algorithm is thus advantageous to many other seed set expansion algorithms that 
usually require a higher fraction of vertices to be known. For example, in the authors 
perform a similar experiment on DBLP network. The performance of their algorithm 
achieves the maximum recall of 0.3 when seeding ratio is 10%, while Lemon can achieve 
an average FI score of 0.66 with 3 vertices. This makes our algorithm practical for real 
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networks when it is impossible to collect a large number of seeds. 


5.3 Further Extension 

As the resnlts of using different seeding methods suggests, high-degree seeds can heuris- 
tically lead to better result on synthetic data. Such heuristic implies that a vertex with 
higher degree may exert higher impact on shaping the subspace we are looking for, and 
thus affect the performance by leading to different sparse vectors where we obtain the 
“candidates” of the target community from. 

In practice, we usually have little control on the seed set. The chance we get a seed 
set of high-degree members is rare. More often than not, the degree of seeds is randomly 
distribnted. We are therefore inspired to tailor our algorithm accordingly in order to 
emphasize the seeds with high degree. The modification is rather straightforward: when 
calculating the initial probability vector po to start a random walk from, instead of evenly 
distributing the amount of probability to each seed, we initialize the probability vector 
according to the degree of each seed. Formally, 


Po{vi) 


I d(v,)/Vol(S) 

I 0 


if Vi € S 
otherwise 


(13) 


where d{vi) denotes the degree of vertex Uj. In other words, we enforce a bias towards the 
high-degree vertices at the beginning of the random walk. Note that each time after the 
reseeding process, the initial probability vector also needs to be recalculated in the same 
way. 

Figure [^depicts the experimental resnlts on LFR benchmark graphs with and without 
degree normalized initialization for the random walk respectively. We can find that degree- 
normalization of the initial probability vector results in better performance. 

We then perform the same experiments on real networks, and find that degree-normalization 
would on the contrary, lead to slightly worse statistical results (see Figure [To]). The com¬ 
pletely different behavior of using degree normalization on real datasets is rather intrigu¬ 
ing. In fact, this phenomenon accords with our previous observation in Section 5.1 that 
a high-degree seed set is less advantageous than random seeds on real datasets. And this 
explains why emphasizing on the high-degree vertices would worsen the performance on 
real datasets. 


5.4 Enlarging the Initial Seed Set 

In Section |5.2[ we see that a larger seed set would lead to better results in general on 
synthetic datasets. But in the situation when there are not many seeds available, can we 
still find a way to improve the performance on synthetic datasets? This can be achieved 
via preprocessing the seed set before running our algorithm. Specifically, for each pair of 
vertices {vi, vj) in the seed set S, we search for the shortest path V that connects Vi and Uj, 


24 




0.0 

2 3 4 5 6 7 8 

overlapping membership 


mu=0.1, Normalized 
mu=0.1, Unnormalized 
mu=0.3, Normalized 
mu=0.3, Unnormalized 


Figure 9: Comparison of the average FI score on LFR datasets with and without normal 
izing the initial probability vector by each seed’s degree. 



Figure 10: Comparison of the average FI score on real datasets with and without normal 
izing the initial probability vector by each seed’s degree. 
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and add the vertices on the path to the original seed set if the length of the shortest path 
\V\ < 3. The intuition behind this idea is that any two seeds in the same community must 
be related for some reason, and they connect with each other either via a direct link or via 
some other intermediate vertices. In the latter case, those intermediate vertices bridging 
the seeds are also likely to be in the target community because they serve as the relational 
“relay” in order for the seeds to be in the same community. 

Note that the procedure of enlarging the initial seed set S differentiates from the re¬ 
seeding process while running the algorithm. The pre-processing is done before we feed 
the seed into the algorithm, which is used for the purpose of increasing the size of initial 
seed set. 


1.0 

0.8 


mu=0.1, enlarged 
mu=0.1, original 
mu=0.3, enlarged 
mu=0.3, original 

0.0 

2 3 4 5 6 7 8 

overlapping membership 



Figure 11: Comparison of the average FI score with and without enlarging the initial seed 
set on LFR benchmark datasets. 

Figure [TT] presents the experimental results on LFR benchmark data with and without 
enlarging the initial seed set in advance. We can see that enlarging the seed set can 
statistically improve the performance. This method can help solve the dilemma of lacking 
enough available seeds. Especially, this method would help when the seed set consists of 3 
or 4 vertices. 


6 Comparison with the state-of-the-art algorithms 

To give a well-rounded performance comparison with state-of-the-art algorithms, we com¬ 
pared our results to three localized community detection algorithms and four global com¬ 
munity detection algorithms. 
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6.1 Comparison with Localized Algorithms 


We refer to the experimental results reported in some recent publications on localized 
community detection algorithms mmm- Figure illustrates the comparison of FI 
scores on Amazon, DBLP, YouTube and Orkut datasets. We use “LEMON-auto” to denote 
the resnlts obtained by applying the stop criteria in Section 4^ Since the results on Orknt 
and YouTube datasets are missing in [25] and m, we use empty bars to indicate them. 


1. Localized algorithms: We encompass three locally based methods, Heat Kernel 
(HK) [To], PageRank (PR) [H] and Seed Set Expansion (SSE) [25] . 

2. Heat Kernel [TH]: The heat kernel (HK) is a type of graph diffusion for locally 
identifying a community nearby a starting seed node. The algorithm can determin¬ 
istically find the community by computing the diffusion. 

3. PageRank [TT] : The personalized PageRank (PR) scheme is computed using the 
power method and jumpback probability a = 0.10 in m 

4. Seed Set Expansion |25|: The seed set expansion (SSE) approach starts with a 
phrase of choosing good seed set. The personalized PageRank scheme is then applied 
to expand the seeds until a community with optimal conductance is found. 


Eigure [T^ show that Lemon achieves an FI score of 0.910 on the Amazon dataset, far 
outperforming the other algorithms. The average FI scores increases the performance by 3 
times compared with the heat kernel algorithm m on Amazon, DBLP and Orkut networks. 
To compare with |25j, we find that the average FI score of our algorithm doubles their 
best performance achieved by the “spread hubs” method on Amazon dataset and triples 
the performance on the DBLP network. The performance comparison of LEMON and 


PageRank has been elaborated in Section 4.3 Also, note that in im, the authors did not 
have an explicit stop criterion and instead assumed using a budget for predicting the size of 
the target community. We compare with the El score at a budget of 100 for both Amazon 
and DBLP datasets. From the results on Amazon networks in we notice that even 
granted a budget of 400, which is far beyond the average community size of 39 in Amazon 
network, only a recall of 0.45 can be achieved. And we infer the FI score would be even 
lower than this value since the precision is dragged down by the large budget set. 

It is also worth noting that we only nse 3 randomly picked seeds for all the test cases 
on each dataset. Our algorithm requires very fewer seeds than other algorithms such as 
HI]. 

The experiment has verified that our algorithm is able to achieve high accuracy on large 
networks constituting communities of average size ronghly hundred. This implies that our 
approach is well-suited for the task of detecting small communities in large networks. 


https://www.cs.purdue.edu/homes/dgleich/codes/hkgrow 
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Figure 12: Comparison of the average FI score with state-of-the-art local detection algo¬ 
rithms on real networks. 


Algorithm 

Implementation 

Amazon 

DBLP 

YouTube 

Orkut 

LEMON 

Python 

0.953s 

0.665 

0.240 

0.202 

LEMON-auto 

Python 

0.910 

0.525 

0.190 

0.170 

DEMON 

Python/C-b-1- 

0.164 

0.196 

0.031 

- 

OSLOM 

C-|—h 

0.766 

0.542 

- 

- 

LC 

Python/C-b-b 

0.815 

0.527 

- 

- 


Table 6: Comparison of accuracy with global algorithms on real datasets. 


Algorithm 

Implementation 

Amazon 

DBLP 

YouTube 

Orkut 

LEMON 

Python 

<15s 

<15s 

<15s 

<15s 

LEMON-auto 

Python 

<15s 

<15s 

<15s 

<15s 

DEMON 

Python/C-b-b 

4,562s 

727,675s 

22,395s 

- 

OSLOM 

C-b-b 

885,867s 

23,262s 

>10d 

- 

LC 

Python/C-b-b 

4,606s 

49,045s 

>10d 

- 


Table 7: Comparison of running time with global algorithms. 
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6.2 Comparison with Global Algorithms 

We also compare local spectral clustering with several state-of-the-art global based algo¬ 
rithms. 


1. OSLOM [I3]: 

OSLOlVp^is based on the optimization of a fitness function expressing the statistical 
signihcance of clusters with respect to random fluctuations (i.e., the random graph 
generated by the configuration model m during community expansion). The worst 
case running time of OSLOM is O(n^). 

2. DEMON [6]: 

The DEMOlNp^ algorithm adopts a local-first approach for finding communities. It 
democratically lets each vertex vote for the communities it sees surrounding it in 
its limited view of the global system using a label propagation algorithm, and then 
merges the local communities into a global collection. 


3. LC [2]: 


Link Communit} (LC) is a global partitioning algorithm that hrst builds a hierar¬ 


chical link dendrogram according to the link similarity and then cuts the dendrogram 
at some threshold to yield link communities. The time complexity is 0{nk‘^^^) where 
^max is the maximum vertex degree in the network. 


Table and summarize the average FI score as well as the running time of each algo¬ 
rithm on real datasets. Among the baselines, OSLOM and LC fail to terminate within 10 
days on the YouTube dataset. The OSLOM algorithm can achieve rather good performance 
but does not scale well. 

In contrast, our algorithm can consistently return the result within few seconds irre¬ 
spective of how large the entire graph is. 

Besides, our algorithm has small memory consumption, and a machine with 4GB RAM 
can afford to process networks as large as Orkut since the algorithm does not have to 
store the whole graph in memory. Moreover, our locally based algorithm is parallelizable 
because each seed set expansion can be computed independently. Such property can bring 
a further performance gain on running time with multi-threaded implementation [25]. 

Figure [T^ and |T4| compares the average FI score with some state-of-the-art algorithms 
on LFR benchmark graphs. During the experimentation, we also incorporate the methods 
that can effectively improve the performance on synthetic datasets that are addressed in 
Section 5.3 We notice that our algorithm outperforms the baseline algorithms even when 


http://www.oslom.org/software.htm 
^“^http: //www.michelecoscia. com/?page_id=42 
https://github.com/bagrow/linkcomm 
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Figure 13: Comparison of the average FI score on LFR datasets (;U = 0.1) with baseline 
algorithms. 
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Figure 14: Comparison of the average FI score on LFR datasets = 0.3) with baseline 
algorithms. 
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we use the random seeding strategy. When the mixing parameter = 0.3, as is shown in 
Figure 13 and 14, Lemon brings about 30% ~ 40% relative improvement compared with 
the best results among the baselines. And we can expect the performance gain to be even 
more significant if the seeds possess the qualities discussed in Section |5.1[ 

Among the four baselines, we notice that LC and DEMON consistently perform poorly 
on both groups of the synthetic datasets. We further look into the communities found by 
LC and DEMON respectively, and find that LC tends to partition the graphs into very 
small pieces while DEMON, on the contrary, usually finds communities that are much larger 
than the ground truth communities. This implies that both algorithms extract structures 
from networks that bear little resemblance to the natural formation of the communities. 
However, we remark here that even LC fails to recognize the communities well on the 
synthetic data, it perform better on real datasets as we see in Table . 


6.3 Empirical Comparison Between Synthetic and Real Data 

Networks are not all similar and we cannot assume one algorithm works for finding commu¬ 
nities in a network will behave the same on the other networks. Therefore, it is important 
to develop the understanding of how different types of networks affect the behavior of 
algorithms. 

Our algorithm sustains a consistent performance on both LER benchmark graphs and 
real networks though, we still want to summarize and call the attention to several subtle 
differences here. 

Eirst, Lemon is less sensitive to the parameter of random walk step k and subspace 
dimension I on real networks than that on LER benchmark graphs. In practice, fixing (k, 1) 
to be (3,3) for real networks can ensure a good performance. 

Second, Lemon is less sensitive to the seed set size on real networks than that on LER 
benchmark. In practice, a seed set size of 3 can guarantee a good performance on real 
networks. As for LER, we adopt the seed set size to be proportional to the community size 
( 8 %). 

Third, Lemon is more sensitive to the high-degree seeds on real networks than that on 
LER benchmark. In LER graphs, the degree of a vertex is at most 50. Whereas in some 
large real networks such as YouTube, the degree of some vertices exceeds 1000, making 
the degree distribution much more screw than that seen in LER graphs. And we expect 
that vertices with unusually high degree in real networks would have a stronger power in 
controlling the trend for the probabilities to spread out during the random walk, and thus 
have a higher risk to enter some other neighboring communities. Such an effect can be 
counterbalanced by putting less initial probabilities on these “super cores”. 

The above empirical analysis informs us that finding communities in real networks 
seems to be less parameterized than that on synthetic datasets for our algorithm. This 
indicates that our algorithm is better suited for uncovering those naturally well-formed 
communities than the artificially constructed communities in practice. 
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7 Conclusion 


The problem of identifying small community structure in large networks has been gaining 
importance. 

In this paper, we have presented a method for finding overlapping communities by seek¬ 
ing a sparse vector in the span of local spectra where the seeds are in its support. To over¬ 
come the drawbacks of traditional spectral clustering methods, we propose a novel method 
to construct the local spectra based on the singular vector approximations drawn from 
short random walks. Our algorithm enables finding a small community in time functional 
to the size of the community, and it consistently returns the result within seconds even 
for a network with billions of vertices. We demonstrate the effectiveness and efficiency of 
our method for discovering communities on both synthetic and real-world datasets. As the 
experimental result shows, our algorithm achieves the highest detection accuracy amongst 
the state-of-the-art proposals. 

Many other fundamentally important research questions remain to be addressed. First, 
the community detection algorithm based on local spectral clustering could be potentially 
applied to the membership detection problem, i.e., finding all the communities that an 
arbitrary vertex belongs to. Second, during the process of seed set expansion, we adopt 
the first low-conductance community as the target community, which usually yields a high 
resemblance to the ground truth community. It would also be interesting to look further 
into some larger low-conductance communities and see if a hierarchical structure exists. 
In this case, some large social group consisting of several small cliques is likely to be 
discovered. 
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